A 9.5mw 330sec 1024-point Fft Processor

نویسنده

  • Bevan M. Baas
چکیده

This paper presents an energy-e cient, single-chip, 1024-point FFT processor. The full-custom, 460,000transistor design has been fabricated in a standard 0.7 m (Lpoly = 0:6 m) CMOS process and is fully functional on rst-pass silicon. At a supply voltage of 1.1V, it calculates a 1024-point complex FFT in 330 sec at a clock speed of 16 MHz while consuming 9.5 mW, resulting in an adjusted energy e ciency more than 16 times greater than the previously most-e cient known FFT processor. At 3.3V, it operates at 173 MHz. Introduction The Fast Fourier Transform (FFT) is one of the most widely used digital signal processing algorithms. While advances in semiconductor processing technology have enabled the performance and integration of FFT processors to increase steadily, these advances have also, unfortunately, lead to an increase in power consumption as well. This has resulted in a situation where the number of potential FFT applications that are limited by power|not performance (e.g., portable applications)| is signi cant and growing. For many CMOS circuits, energy consumption is proportional to the supply voltage squared [1]. Consequently, tremendous e ciency can be gained by aggressively reducing the supply voltage. Unfortunately, circuit performance is reduced with lower supply voltages. The processor presented here is designed to operate with a low supply voltage, Vdd, which approaches the value of the transistor thresholds, Vt, to dramatically increase the overall energy-e ciency. To regain some lost performance, the processor utilizes a high-performance algorithm and architecture that is shown to be better performing than previous designs. Processor Architecture As with most DSP algorithms, FFTs are very memory-intensive. FFTs are calculated in O(logN) stages, where N is the length of the transform, and each stage requires the reading and writing of all N data words. To maintain good performance, many previous \longer-length" (N 1024) FFT processor designs used multiple datapaths and large crossbar, bus, or network structures connected to a partitioned memory. To avoid this interconnection bottleneck, the chip presented Cache Processor Main Memory Figure 1: System block diagram here implements a data-caching algorithm which provides increased energy-e ciency (by reducing communication energy) and increased performance (through deep pipelining). Figure 1 is a high-level block diagram of the system showing the tightly-coupled processor-cache pair and the N -word main memory. It is well known that data caches increase the e ective bandwidth to a memory|but only if the memory access pattern exhibits a fair amount of locality. Although nearly all FFT algorithms have very poor locality, in [2], an algorithm is described which o ers good locality over large portions of the computation. The global communication inherent in the FFT is concentrated into a few (typically 1 or 2) intermediate steps and is easily accomplished through appropriate addressing when lling and ushing the cache. Because the FFT algorithm is deterministic, cache tags are unnecessary and correct cache operation is achieved through predetermined cache addressing and pre-fetching of data from main memory. The Spi ee Processor A 1024-point single-chip cached-architecture FFT processor named Spi ee was designed and fabricated. It operates on complex 36-bit (18-bit Real + 18-bit Imaginary) xed-point data and has internal datapath widths varying between 20{24 bits. The processor was designed with two epochs, E, so each word in main memory is read and written twice per transform. With N = 1024, the cache size C equals E p N = 2 p 1024 = 32 words. Although the architecture easily supports multiple processors, the chip presented here contains a single processor/cache pair and a single set of main memory. The data cache reduces tra c to main memory by a factor of logr(N)=E, which in this implementation is log2(1024)=2 = 5. This allows more processors to be added and/or a slower lower-power main memory to be used. Power used to access data decreases since data words are stored in a smaller memory and nearer to the datapath. MEM CROSSB MULT1 MULT2 MULT3 ADD/SUB CMULT ADD/SUB XY CROSSB MEM READ WRITE A B W B x W X = A+BW Y = A-BW X Y Figure 2: Pipeline diagram While higher-radix, prime-factor, and other FFT algorithms have been shown to require fewer operations than Radix-2 [3], Spi ee was designed with a Radix-2 decomposition. This is due to the fact that for a VLSI implementation, the regularity and simplicity of an algorithm are very important factors in determining the clock speed, design time, and other key parameters. The processor's datapath calculates one complex radix-2 decimation-in-time butter y [4] per cycle. This requires the calculation of two butter y outputs, X and Y , from two butter y inputs, A and B, and a complex coe cient W , using the equations: X = A + BW and Y = A BW . In general, all variables are complex. The datapath and cache are aggressively pipelined to retain high performance, as evidenced by the 9-stage pipeline diagram shown in Figure 2. In the rst pipeline stage, A and B are read from the caches and W is read from a ROM. In stage two, the operands are routed through two 2 2 crossbars to the correct functional units. Four Bfreal,imagg Wfreal,imagg multiplications of the real and imaginary components of B and W are calculated in stages three through ve. Stage six completes the complex multiplication, stage seven performs the remaining additions or subtractions to calculate X and Y , and pipeline stages eight and nine complete the routing and write-back of the results. The deep pipeline causes a read-after-write data hazard to occur once every 80 cycles and is handled by stalling the pipeline for one cycle. Figure 3 is a block diagram of the chip and Figure 4 is the corresponding die microphotograph. Four signed, pipelined, array multipliers produce 24-bit products from 20-bit operands. They use booth-2 encoding and use (4,2) and (3,2) adders to reduce partial products. Six single-cycle 24-bit adders and subtractors propagate carries using a hybrid of carry-lookahead and ripple techniques. The ROM is organized as two 256-word 40bit arrays and stores complex W coe cients used in the FFT kernel calculations. The two sets of caches are designed so that one set can perform calculations while the other is being ushed and lled from memory. Each set is organized as two banks of 16-word 40-bit dual-ported SRAM arrays using 10-transistor cells. The main memory is made up of eight 128-word 36-bit SRAM arrays using 6-transistor cells. The full-custom design contains 460,000 transistors and was fabricated in a standard single-poly, tripleChip Controller 16 x 40-bit Cache

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Low-Power, High-Performance, 1024-Point FFT Processor

This paper presents an energy-efficient, single-chip, 1024-point fast Fourier transform (FFT) processor. The 460 000transistor design has been fabricated in a standard 0.7 m (Lpoly = 0:6 m) CMOS process and is fully functional on firstpass silicon. At a supply voltage of 1.1 V, it calculates a 1024-point complex FFT in 330 s while consuming 9.5 mW, resulting in an adjusted energy efficiency mor...

متن کامل

Low-Power, High-Performance TTA Processor for 1024-Point Fast Fourier Transform

Transport Triggered Architecture (TTA) offers a cost-effective tradeoff between the size and performance of ASICs and the programmability of general-purpose processors. This paper presents a study where a high performance, low power TTA processor was customized for a 1024-point complexvalued fast Fourier transform (FFT). The proposed processor consumes only 1.55 μJ of energy for a 1024-point FF...

متن کامل

Design and Implementation of a 1024-point Pipeline FFT Processor

Design and implementation of a 1024-point pipeline FFT processor is presented. The architecture is based on a new form of FFT, the r a d i ~ 2 ~ algorithm. By exploiting the spatial regularity of the new algorithm, minimal requirement for both dominant components in PLSI implementation has been achieved: only 4 complex multipliers and 1024 complex-word data memory for the pipelined 1K FFT proce...

متن کامل

A Pipeline Fft Processor

In this paper, we discuss the design and implementation of a high-speed, low power 1024-point pipeline FFT processor. Key features are flexible internal data length and a novel processing element. The FFT processor, which is implemented in a standard 035 pm CMOS process, is efficient in term of power consumption and chip area.

متن کامل

Complex Multiplication Reduction in Fft Processors

The number of multiplications has been used as a key metrics for comparing FFT algorithms since it has a large impact on the execution time and total power consumption. In this paper, we present a 16-point FFT Butterfly PE, which reduces the multiplicative complexity by using real, constant multiplications. A 1024-point FFT processor has been implemented using 16-point and 4-point Butterfly PEs...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998